feat: add native implementations of `regexp_extract` and `regexp_extract_all` by andygrove · Pull Request #4146 · apache/datafusion-comet

andygrove · 2026-04-29T21:59:43Z

Which issue does this PR close?

Rationale for this change

regexp_extract and regexp_extract_all are common Spark SQL string functions used to pull substrings out of input strings via regex capture groups. Adding native support lets queries that use them stay in Comet instead of falling back to Spark.

What changes are included in this PR?

regexp_extract:

New Rust UDF spark_regexp_extract in native/spark-expr/src/string_funcs/regexp_extract.rs, backed by the regex crate. Handles Utf8 and LargeUtf8 inputs (array and scalar), idx defaults to 1, idx=0 returns the whole match, no match returns the empty string, an unmatched optional group returns the empty string, null input returns null, and an out-of-range idx returns an execution error.
Registration of regexp_extract in comet_scalar_funcs.rs.
New CometRegExpExtract serde mapping RegExpExtract to the native UDF. Reported as Incompatible because the Rust regex engine has different semantics from Java's regex engine (POSIX classes, look-around, possessive quantifiers, etc.). Users opt in via spark.comet.expression.RegExpExtract.allowIncompatible=true. Falls back when the pattern or idx is non-literal.

regexp_extract_all:

New Rust UDF spark_regexp_extract_all in native/spark-expr/src/string_funcs/regexp_extract_all.rs, paralleling the regexp_extract UDF. Returns List<Utf8> containing the selected capture group across every non-overlapping match of the pattern.
Registration of regexp_extract_all in comet_scalar_funcs.rs.
New CometRegExpExtractAll serde mapping RegExpExtractAll to the native UDF. Also reported as Incompatible for the same regex-engine reasons; gated on spark.comet.expression.RegExpExtractAll.allowIncompatible=true. Falls back when the pattern or idx is non-literal.

How are these changes tested?

9 Rust unit tests in regexp_extract.rs covering basic group extraction, idx=0/default idx, null subject, null pattern, unmatched optional group, out-of-range idx, negative idx, and invalid regex.
11 Rust unit tests in regexp_extract_all.rs covering the analogous cases for the array-returning variant.
Four new Comet SQL test files:
- regexp_extract.sql / regexp_extract_all.sql verify that each expression falls back to Spark by default.
- regexp_extract_enabled.sql / regexp_extract_all_enabled.sql exercise the happy paths with allowIncompatible=true, including default and explicit idx, idx=0, no-match, null input, optional unmatched groups, anchors, and all-literal expressions. Run under a ConfigMatrix for both dictionary-encoded and plain Parquet input.

Implement regexp_extract using the Rust regex crate. The expression is marked Incompatible because the Rust regex engine differs from the Java engine that Spark uses; users must opt in via spark.comet.expression.RegExpExtract.allowIncompatible=true.

Audit follow-ups: - Align Rust error messages with Spark's `INVALID_PARAMETER_VALUE` templates so `expect_error` substrings can match both engines. - Override `getUnsupportedReasons` in `CometRegExpExtract` so the non-literal pattern and non-literal idx reasons are picked up by the Compatibility Guide generator. - Add Comet SQL test cases for: NULL pattern and NULL idx, idx=0 with no capture groups, multibyte / Unicode subjects, idx out of range, pattern with no groups + idx>=1, negative idx, invalid regex syntax, and a Java-only lookahead that Rust regex rejects (marked `ignore`). - Add fallback test cases for non-literal pattern and non-literal idx. - Mark the expression supported in `spark_expressions_support.md` with per-version audit notes.

Address review feedback: - Make `extract_array` build a `GenericStringBuilder<O>` matching the input offset size so a `LargeUtf8` subject no longer silently outputs `Utf8` (avoids potential i32-offset overflow on >2GB inputs). - Inline group extraction so the per-row `String` allocation is gone; the only remaining `to_string` is on the rare scalar code path. - Replace the manual append-null loop in `null_result` with `StringArray::new_null(n)`. - Borrow the pattern as `&str` instead of cloning it before calling `Regex::new`. - Pass `failOnError = false` to the proto, matching `CometStringSplit`. The Rust UDF does not branch on this flag, so `true` was misleading.

The test data has no duplicate rows, so the parquet.enable.dictionary matrix produces two identical runs.

Adds a native Rust UDF `spark_regexp_extract_all` and a `CometRegExpExtractAll` serde, paralleling the existing `regexp_extract` support. Returns `List<Utf8>` containing the matched group across every non-overlapping match of the pattern. Reported as Incompatible because the Rust regex engine differs from Java's; gated on `spark.comet.expression.RegExpExtractAll.allowIncompatible=true`. Falls back when the pattern or `idx` is non-literal.

# Conflicts: # spark/src/main/scala/org/apache/comet/serde/strings.scala

mbutrovich · 2026-05-28T20:15:32Z

Thanks for tackling these @andygrove, the support-level handling and the SQL test shape (including the explicit ignore for lookahead) look great.

A few things I wanted to ask about:

Could we lift this into a `CometRegExpBase`?

CometRLike, CometRegExpReplace, and the two new ones all carry the same Rust-vs-Java caveat, and all four take a literal pattern as their first regex arg. RLike and RegExpReplace already do two things the new pair doesn't:

They use the shared regexp allow-incompat key (CometConf.isExprAllowIncompat("regexp")), so the user opts in once for the whole family. The new PR introduces per-class RegExpExtract.allowIncompatible / RegExpExtractAll.allowIncompatible flags, which diverges.
They call RegExp.isSupportedPattern(pattern) to upgrade safe patterns to Compatible rather than always being Incompatible.

Worth folding all four into a common base?

abstract class CometRegExpBase[T <: Expression] extends CometExpressionSerde[T] {
  protected def patternOf(expr: T): Expression
  protected def isPatternSafe(expr: T): Boolean =
    patternOf(expr) match {
      case Literal(p, DataTypes.StringType) => RegExp.isSupportedPattern(p.toString)
      case _ => false
    }

  override def getIncompatibleReasons(): Seq[String] = Seq(
    "Uses Rust regexp engine, which has different behavior to Java regexp engine")

  protected def regexSupportLevel(expr: T): SupportLevel =
    patternOf(expr) match {
      case _: Literal if isPatternSafe(expr) => Compatible()
      case _: Literal => Incompatible()
      case _ => Unsupported(Some("Only scalar regexp patterns are supported"))
    }
}

CometRegExpExtract / CometRegExpExtractAll would then add their idx-must-be-literal check on top of regexSupportLevel, and the existing CometRLike / CometRegExpReplace lose their bespoke copies. That also gets the new pair onto the shared regexp allow-incompat key, which I think is what users would expect.

Same idea on the Rust side?

regexp_replace is delegated upstream and RLike is its own PhysicalExpr, so there's no big cross-family dedup available, but within this PR regexp_extract and regexp_extract_all still share their arg parsing, regex compile, group-count validation, subject_len, null_result, and subject dispatch verbatim. A small shared helper that returns the parsed (regex, group_idx, subject) tuple (or an early-exit null result) would leave each UDF as just its own per-row loop:

pub(super) struct ParsedArgs<'a> {
    pub regex: Regex,
    pub group_idx: usize,
    pub subject: &'a ColumnarValue,
}

pub(super) enum ParseResult<'a> {
    Parsed(ParsedArgs<'a>),
    NullResult(ColumnarValue), // null pattern or null idx short-circuit
}

pub(super) fn parse_regexp_extract_args<'a>(
    fn_name: &'static str,
    args: &'a [ColumnarValue],
) -> DataFusionResult<ParseResult<'a>> { ... }

`LargeUtf8` result type in `regexp_extract`

extract_array::<O> is generic over the offset width, so a LargeUtf8 input array yields a LargeStringArray, but the serde declares the return type as expr.dataType (Utf8). Is that intentional, or worth always returning Utf8 (the scalar branch already does)? Something like:

DataType::Utf8 | DataType::LargeUtf8 => {
    let strings: &dyn Array = array.as_ref();
    Ok(ColumnarValue::Array(extract_array(strings, &regex, group_idx)))
}

with extract_array always producing a StringArray (i32 offsets), since &str slices are width-agnostic. regexp_extract_all looks consistent here since its value builder is always StringBuilder.

Possibly related: I didn't see a Rust unit test for a LargeUtf8 subject in either file, would be nice to lock that branch down.

andygrove added 3 commits April 29, 2026 15:59

andygrove marked this pull request as ready for review April 30, 2026 01:03

andygrove marked this pull request as draft May 6, 2026 13:38

andygrove changed the title ~~feat: support Spark expression regexp_extract~~ feat: add native implementation of regexp_extract May 6, 2026

andygrove added 3 commits May 12, 2026 09:12

test: drop redundant dictionary ConfigMatrix from regexp_extract_enabled

802b4d6

The test data has no duplicate rows, so the parquet.enable.dictionary matrix produces two identical runs.

Merge remote-tracking branch 'apache/main' into feat/regexp-extract

0165b05

# Conflicts: # spark/src/main/scala/org/apache/comet/serde/strings.scala

andygrove marked this pull request as ready for review May 27, 2026 20:42

andygrove changed the title ~~feat: add native implementation of regexp_extract~~ feat: add native implementations of regexp_extract and regexp_extract_all May 27, 2026

andygrove requested review from comphead, mbutrovich and parthchandra May 27, 2026 20:51

andygrove marked this pull request as draft May 28, 2026 19:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add native implementations of `regexp_extract` and `regexp_extract_all`#4146

feat: add native implementations of `regexp_extract` and `regexp_extract_all`#4146
andygrove wants to merge 6 commits into
apache:mainfrom
andygrove:feat/regexp-extract

andygrove commented Apr 29, 2026 •

edited

Loading

Uh oh!

mbutrovich commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andygrove commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

mbutrovich commented May 28, 2026

Could we lift this into a CometRegExpBase?

Same idea on the Rust side?

LargeUtf8 result type in regexp_extract

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andygrove commented Apr 29, 2026 •

edited

Loading

Could we lift this into a `CometRegExpBase`?

`LargeUtf8` result type in `regexp_extract`